| Reg. No. |  |  |  |  |  |  |
|----------|--|--|--|--|--|--|
|----------|--|--|--|--|--|--|



## VI SEMESTER B.TECH. (COMPUTER SCIENCE ENGINEERING) END SEMESTER EXAMINATIONS, APRIL 2018

## PARALLEL COMPUTER ARCHITECTURE AND PROGRAMMING [CSE 3202] REVISED CREDIT SYSTEM (20/04/2018)

Time: 3 Hours MAX. MARKS: 50

## Instructions to Candidates:

❖ Answer **ALL** the questions.

whether kernel is successfully launched or not.

Missing data may be suitable assumed.

| 1A. | Discuss why applications will continue to demand increased speed? Use a few examples to justify your discussion.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 3M |
|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1B. | What is Very Long Instruction Word (VLIW) instruction? Explain. List the advantages and disadvantages of VLIW architecture that increase the amount of computation.                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 4M |
| 1C. | It is required to check the error present in the CUDA kernel with kernel function name CUDAStrCopy in a CUDA program. Write a segment of that program to display the error message present in it and also give your explanation to the segment of program that you wrote.                                                                                                                                                                                                                                                                                                                                                               | 3M |
| 2A. | Define a communicator in MPI? Discuss on getting communicator information such as Rank and Size.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 2M |
| 2B. | Write a MPI program to meet the following: The master task initializes an array and then distributes an equal portion of that array to the other tasks. After the other tasks receive their portion of the array, they perform an addition operation to each array element. They also maintain a sum for their portion of the array. The master task does likewise with its portion of the array. An MPI collective communication call is used to collect the sums maintained by each task. Finally, the master task displays global sum of array elements. Assume that the number of elements are evenly divisible by number of tasks. | 4M |
| 2C. | Subjected to MPI error handling discuss on predefined default error handler, predefined error handler, the error code that MPI standard defines, error class. Write a MPI program to invoke them to show how error handling is performed in MPI.                                                                                                                                                                                                                                                                                                                                                                                        | 4M |
| 3A. | Write a kernel in OpenCL to calculate the approximated value of $\pi$ .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | 4M |
| 3B. | It is required to write a part of the main program in OpenCL starting from kernel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |    |

CSE 3202 Page 1 of 2

creation till launching the kernel on GPU of Q.3A. Do a verification in your code

**4M** 

| 3C. | Write a note on pros and cons of simultaneous multithreading?                                                                                                                                                                                                                                                            | 2M |
|-----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 4A. | Write a program in CUDA to implement matrix addition of two input matrices MATA and MATB to store the result in matrix MATC using 2D grids and 2D blocks. Assume that every element in MATC is a two digit number. Now modify that matrix MATC by swapping the digits of every element of it using 1D grid and 1D block. | 5M |
| 4B. | How do you find the execution time of kernel in CUDA? Give an example program.                                                                                                                                                                                                                                           | 3M |
| 4C. | You have to create 6 blocks and every block should contain 8 threads. With the help of a block diagram show how you can create this using 2D Grid of 3D Block.                                                                                                                                                           | 2M |
| 5A. | Explain what the Raster operation (ROP) stage performs in a fixed-function graphics pipeline. Discuss with diagram, the aliased and anti-aliased effect on a triangle geometry to perform final raster operation on pixels.                                                                                              | 3M |
| 5B. | Define Compute to Global Memory Access (CGMA) ratio that decides the performance of a CUDA kernel. What is expected for CGMA ratio to be high or low? How do you achieve such CGMA ratio in CUDA?                                                                                                                        | 3M |
| 5C. | Discuss the dynamic process group API functions that are built on top of the core PVM routines.                                                                                                                                                                                                                          | 4M |

\*\*\*\*\*\*